186 research outputs found
Privacy Preservation by Disassociation
In this work, we focus on protection against identity disclosure in the
publication of sparse multidimensional data. Existing multidimensional
anonymization techniquesa) protect the privacy of users either by altering the
set of quasi-identifiers of the original data (e.g., by generalization or
suppression) or by adding noise (e.g., using differential privacy) and/or (b)
assume a clear distinction between sensitive and non-sensitive information and
sever the possible linkage. In many real world applications the above
techniques are not applicable. For instance, consider web search query logs.
Suppressing or generalizing anonymization methods would remove the most
valuable information in the dataset: the original query terms. Additionally,
web search query logs contain millions of query terms which cannot be
categorized as sensitive or non-sensitive since a term may be sensitive for a
user and non-sensitive for another. Motivated by this observation, we propose
an anonymization technique termed disassociation that preserves the original
terms but hides the fact that two or more different terms appear in the same
record. We protect the users' privacy by disassociating record terms that
participate in identifying combinations. This way the adversary cannot
associate with high probability a record with a rare combination of terms. To
the best of our knowledge, our proposal is the first to employ such a technique
to provide protection against identity disclosure. We propose an anonymization
algorithm based on our approach and evaluate its performance on real and
synthetic datasets, comparing it against other state-of-the-art methods based
on generalization and differential privacy.Comment: VLDB201
Scalable Probabilistic Similarity Ranking in Uncertain Databases (Technical Report)
This paper introduces a scalable approach for probabilistic top-k similarity
ranking on uncertain vector data. Each uncertain object is represented by a set
of vector instances that are assumed to be mutually-exclusive. The objective is
to rank the uncertain data according to their distance to a reference object.
We propose a framework that incrementally computes for each object instance and
ranking position, the probability of the object falling at that ranking
position. The resulting rank probability distribution can serve as input for
several state-of-the-art probabilistic ranking models. Existing approaches
compute this probability distribution by applying a dynamic programming
approach of quadratic complexity. In this paper we theoretically as well as
experimentally show that our framework reduces this to a linear-time complexity
while having the same memory requirements, facilitated by incremental accessing
of the uncertain vector instances in increasing order of their distance to the
reference object. Furthermore, we show how the output of our method can be used
to apply probabilistic top-k ranking for the objects, according to different
state-of-the-art definitions. We conduct an experimental evaluation on
synthetic and real data, which demonstrates the efficiency of our approach
T-Crowd: Effective Crowdsourcing for Tabular Data
Crowdsourcing employs human workers to solve computer-hard problems, such as
data cleaning, entity resolution, and sentiment analysis. When crowdsourcing
tabular data, e.g., the attribute values of an entity set, a worker's answers
on the different attributes (e.g., the nationality and age of a celebrity star)
are often treated independently. This assumption is not always true and can
lead to suboptimal crowdsourcing performance. In this paper, we present the
T-Crowd system, which takes into consideration the intricate relationships
among tasks, in order to converge faster to their true values. Particularly,
T-Crowd integrates each worker's answers on different attributes to effectively
learn his/her trustworthiness and the true data values. The attribute
relationship information is also used to guide task allocation to workers.
Finally, T-Crowd seamlessly supports categorical and continuous attributes,
which are the two main datatypes found in typical databases. Our extensive
experiments on real and synthetic datasets show that T-Crowd outperforms
state-of-the-art methods in terms of truth inference and reducing the cost of
crowdsourcing
APRIL: Approximating Polygons as Raster Interval Lists
The spatial intersection join an important spatial query operation, due to
its popularity and high complexity. The spatial join pipeline takes as input
two collections of spatial objects (e.g., polygons). In the filter step, pairs
of object MBRs that intersect are identified and passed to the refinement step
for verification of the join predicate on the exact object geometries. The
bottleneck of spatial join evaluation is in the refinement step. We introduce
APRIL, a powerful intermediate step in the pipeline, which is based on raster
interval approximations of object geometries. Our technique applies a sequence
of interval joins on 'intervalized' object approximations to determine whether
the objects intersect or not. Compared to previous work, APRIL approximations
are simpler, occupy much less space, and achieve similar pruning effectiveness
at a much higher speed. Besides intersection joins between polygons, APRIL can
directly be applied and has high effectiveness for polygonal range queries,
within joins, and polygon-linestring joins. By applying a lightweight
compression technique, APRIL approximations may occupy even less space than
object MBRs. Furthermore, APRIL can be customized to apply on partitioned data
and on polygons of varying sizes, rasterized at different granularities. Our
last contribution is a novel algorithm that computes the APRIL approximation of
a polygon without having to rasterize it in full, which is orders of magnitude
faster than the computation of other raster approximations. Experiments on real
data demonstrate the effectiveness and efficiency of APRIL; compared to the
state-of-the-art intermediate filter, APRIL occupies 2x-8x less space, is
3.5x-8.5x more time-efficient, and reduces the end-to-end join cost up to 3
times.Comment: 12 page
- …